Skip to content

[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python#10273

Merged
QiJune merged 33 commits intoNVIDIA:mainfrom
lancelly:unified_python_scheduler
Jan 20, 2026
Merged

[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python#10273
QiJune merged 33 commits intoNVIDIA:mainfrom
lancelly:unified_python_scheduler

Conversation

@lancelly
Copy link
Collaborator

@lancelly lancelly commented Dec 24, 2025

As titled. This PR is our first step to refactor the scheduler.
Goal:

  • Re-implement the existing C++ MicroBatchScheduler and CapacityScheduler logic in Python 1:1, without architectural changes.

Deliverables:

  • PyMicroBatchScheduler & PyCapacityScheduler classes.
  • Integration into PyExecutor behind the feature flag.

The overhead of python scheduler seems to be acceptable even in scenarios where the host overhead is the bottleneck. The benchmark results are:

  • E2E benchmarks of GPT-OSS-120B:
    • Config: max throughput for GB200 + GPT-OSS-120B + Agg
    • Server: ADP2 + max_batch_size 1536
    • Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
    • Result:
      • Output token throughput has around 1.3% gap after 50runs.
      • Schedule time accounts for approximately 4.4% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 45.11 ms, while with the python scheduler it is 46.02 ms, an increase of 2.0%.
  • E2E benchmarks of Llama-3.2-1B:
    • Config: GB200 + Llama-3.2-1B + Agg
    • Server: ADP2 + max_batch_size 1536
    • Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
    • Result:
      • Output token throughput has around 1.4% gap after 50runs.
      • Schedule time accounts for approximately 4.45% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 44.53 ms, while with the python scheduler it is 45.31 ms, an increase of 1.75%.

Details can be found in: Unified Python SPMD Scheduler Execution Plan & Performance Strategy

QiJune and others added 22 commits December 17, 2025 13:37
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly lancelly requested review from a team as code owners December 24, 2025 09:20
@lancelly lancelly requested a review from HuiGao-NV December 24, 2025 09:20
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@lancelly lancelly requested review from QiJune and litaotju December 24, 2025 09:22
@tensorrt-cicd
Copy link
Collaborator

PR_Github #29796 [ run ] triggered by Bot. Commit: 411c254

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

These changes extend Python bindings for GenLlmReq and KVCacheManager C++ classes, add an environment variable to enable Python-based scheduling, and introduce a comprehensive Python scheduling framework with capacity and micro-batch scheduling policies as an alternative to C++ scheduler components.

Changes

Cohort / File(s) Change Summary
C++ Pybind/Nanobind Bindings
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp, cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
Expose new GenLlmReq Python methods: get_unique_tokens(beam) and get_unique_tokens() overloads, plus get_encoder_unique_tokens() returning optional VecUniqueTokens. Adjust binding chain on use_draft_model to enable additional chained bindings.
KV Cache Manager Bindings
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Add Python bindings for find_new_context_block(unique_tokens, llm_request) on BaseKVCacheManager and scheduling_has_free_blocks(num_required, window_size) on KVCacheManager, delegating to underlying C++ implementations.
Scheduler Initialization & Configuration
tensorrt_llm/__init__.py, tensorrt_llm/_torch/pyexecutor/_util.py
Set TLLM_USE_PYTHON_SCHEDULER=1 environment variable on startup. Add conditional logic in create_py_executor_instance to select SimpleUnifiedScheduler when flag is enabled; otherwise retain existing C++ scheduler selection logic.
Python Scheduling Framework
tensorrt_llm/_torch/pyexecutor/scheduler.py
Introduce comprehensive Python-based scheduling system: PyCapacityScheduler (orchestrator with policy-based fitting), PyMicroBatchScheduler (encoder/context/generation batching), and SimpleUnifiedScheduler (composite runner). Add SchedulerPolicyBase with MaxRequestsPolicy, GuaranteedNoEvictPolicy, MaxUtilizationPolicy implementations; block-tracking managers; ChunkingPolicy enum; and state/prioritization logic mirroring C++ behavior.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Executor as Executor
    participant Sched as SimpleUnifiedScheduler
    participant Capacity as PyCapacityScheduler
    participant MicroBatch as PyMicroBatchScheduler
    participant KVCache as KVCacheManager
    participant Policy as SchedulerPolicy

    Executor->>Sched: schedule(pending_requests, running_requests, kv_cache_manager)
    activate Sched
    
    Sched->>Capacity: schedule(pending, running, kv_cache_manager)
    activate Capacity
    
    Capacity->>Policy: get_new_request_ids(pending)
    activate Policy
    Policy->>Capacity: filtered_request_ids
    deactivate Policy
    
    loop For each candidate request
        Capacity->>KVCache: find_new_context_block(unique_tokens, request)
        KVCache->>Capacity: context_block_info
        Capacity->>Capacity: fit_request_to_blocks()
    end
    
    Capacity->>Sched: scheduled_requests, paused_requests
    deactivate Capacity
    
    Sched->>MicroBatch: schedule(scheduled_requests, kv_cache_manager)
    activate MicroBatch
    
    MicroBatch->>MicroBatch: compute_chunk_sizes()
    
    rect rgb(200, 220, 255)
        note right of MicroBatch: Encoder phase
        MicroBatch->>KVCache: scheduling_has_free_blocks()
        KVCache->>MicroBatch: has_free
    end
    
    rect rgb(220, 240, 220)
        note right of MicroBatch: Context phase
        MicroBatch->>MicroBatch: select_requests_for_context()
    end
    
    rect rgb(255, 240, 200)
        note right of MicroBatch: Generation phase
        MicroBatch->>MicroBatch: select_requests_for_generation()
    end
    
    MicroBatch->>Sched: SchedulerOutput (batches, tokens)
    deactivate MicroBatch
    
    Sched->>Executor: SchedulerOutput
    deactivate Sched
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description provides technical context and benchmark results but lacks key template sections like PR title format, clear problem statement, test coverage details, and completion checklist. Add a properly formatted title following [JIRA/ticket][type] format, clearly state the problem/solution, explicitly list test cases, and complete the PR checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: re-implementing MicroBatchScheduler and CapacityScheduler in Python, with the JIRA ticket properly referenced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31953 [ run ] completed with state SUCCESS. Commit: 82fac4d
/LLM/main/L0_MergeRequest_PR pipeline #24754 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31985 [ run ] triggered by Bot. Commit: 82fac4d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31985 [ run ] completed with state SUCCESS. Commit: 82fac4d
/LLM/main/L0_MergeRequest_PR pipeline #24778 completed with status: 'SUCCESS'

Copy link
Collaborator

@Funatiq Funatiq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I haven't look at the implementation in detail yet. Two suggestions:

  1. I think we should add a simple test to CI so that we don't break the functionality by accident. E.g. run test_overlap_scheduler.py with the Python scheduler too.
  2. Can we run a performance check on a smaller model like Llama-3.2-1B? The overhead should be more significant there. Ideally we should add NVTX ranges and collect nsys profiles to isolate the differences in execution time for the scheduler.

@lancelly
Copy link
Collaborator Author

Great work. I haven't look at the implementation in detail yet. Two suggestions:

  1. I think we should add a simple test to CI so that we don't break the functionality by accident. E.g. run test_overlap_scheduler.py with the Python scheduler too.
  2. Can we run a performance check on a smaller model like Llama-3.2-1B? The overhead should be more significant there. Ideally we should add NVTX ranges and collect nsys profiles to isolate the differences in execution time for the scheduler.

Thanks for the review! @Funatiq

  1. Will do.
  2. I'll run a perfmance check on Llama-3.2-1B. The time breakdown listed above is done by adding timer logs and also verified with nsys profiles.

@Funatiq
Copy link
Collaborator

Funatiq commented Jan 15, 2026

Since you already have nsys profiles, could you report what the runtime for only the _schedule range is in both cases?

@lancelly
Copy link
Collaborator Author

lancelly commented Jan 16, 2026

Since you already have nsys profiles, could you report what the runtime for only the _schedule range is in both cases?

Sure, this image shows the result mentioned above(nsys reports different schedule time for each iteration, so we only recorded the avgs/medians). Detailes are in: https://docs.google.com/document/d/1he4S6hzDBApMGp2Bl5PTED-hcKaRmXDZbOi9EgJlK5A/edit?tab=t.0
截屏2026-01-16 13 29 02

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32429 [ run ] triggered by Bot. Commit: baeee83

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jan 18, 2026
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jan 18, 2026
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32433 [ run ] triggered by Bot. Commit: baeee83

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32442 [ run ] triggered by Bot. Commit: fc794cb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32442 [ run ] completed with state SUCCESS. Commit: fc794cb
/LLM/main/L0_MergeRequest_PR pipeline #25131 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32456 [ run ] triggered by Bot. Commit: fc794cb

@lancelly
Copy link
Collaborator Author

@Funatiq Hi, I added an E2E benchmark on Llama-3.2-1B as suggested. The hostoverhead also seems acceptable on Llama-3.2-1B:

  • Config: GB200 + Llama-3.2-1B + Agg
  • Server: ADP2 + max_batch_size 1536
  • Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
  • Result
    • Output token throughput has around 1.4% gap after 50runs.
    • Schedule time accounts for approximately 4.45% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 44.53 ms, while with the python scheduler it is 45.31 ms, an increase of 1.75%.
截屏2026-01-18 22 52 22 截屏2026-01-18 22 48 20

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32456 [ run ] completed with state SUCCESS. Commit: fc794cb
/LLM/main/L0_MergeRequest_PR pipeline #25143 completed with status: 'SUCCESS'

@Funatiq
Copy link
Collaborator

Funatiq commented Jan 19, 2026

Thanks for the benchmarks. Could you add a short summary to the PR description please?

@lancelly
Copy link
Collaborator Author

Thanks for the benchmarks. Could you add a short summary to the PR description please?

Sure, I have updated the PR description. I think this PR can be merged.

@lancelly
Copy link
Collaborator Author

lancelly commented Jan 19, 2026

@eopXD @nvpohanh Hi, could you please review/approve this PR? Thanks!

@QiJune QiJune merged commit dbb858a into NVIDIA:main Jan 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants